Skip to content

feat(llm-access): keyword-based session moderation gate#63

Merged
acking-you merged 11 commits into
masterfrom
feat/keyword-moderation-gate
Jul 4, 2026
Merged

feat(llm-access): keyword-based session moderation gate#63
acking-you merged 11 commits into
masterfrom
feat/keyword-moderation-gate

Conversation

@acking-you

Copy link
Copy Markdown
Owner

概述

为 Kiro / Codex 网关新增一个上游派发前置的关键词审核模块。当请求正文命中配置的关键词时,对应 session 会被封禁,本次及后续请求都会被拦截;命中时会把完整请求体 + 脱敏后的请求头记录一次供审核,审核员可对误封的 session 解封。

需求对应

需求 实现
关键词存储/添加 + 上游前置匹配 AdminModerationStore + migration 0036;匹配挂在 kiro / codex / 直连 Anthropic 三条派发路径的上游前
封禁 session + 整体请求体/header/命中关键词存储 llm_moderation_banned_sessions(JSONB 存 body/headers,附命中关键词与上下文片段)
后台审核系统前端 新增 Yew 页面 /admin/llm-gateway/moderation:关键词管理 + 封禁审核两个 tab
尽量少命中 DB、走缓存 见下
关键词支持 txt / json txt(每行一个词/短语)与 json(数组或 {"keywords":[...]} 或对象数组)均支持
非简单 contains,短语查询 + 最优算法 Aho-Corasick 自动机;文本归一化后仅拼接消息正文内容再匹配

匹配引擎

  • Aho-Corasick 一次扫描全部关键词;文本先归一化(小写、空白折叠为单空格),所以短语不受换行/多空格格式影响。
  • ASCII 关键词要求词边界ass 不会命中 class),CJK 短语无需边界即可命中。
  • 只提取用户可见正文(Anthropic system + messages[].content;OpenAI/Codex instructions + content[].text),不把 JSON 字段名、模型 id、工具 schema 等结构噪声纳入匹配。

缓存 / 性能(关键设计)

  • 请求热路径完全不读 Postgres:编译好的关键词自动机 + 封禁/白名单 session-key 集合常驻内存,仅在启动和固定周期刷新。
  • 已封禁 session 直接拒绝,不扫描、不写库。
  • 新封禁只写一次 PG(ON CONFLICT DO NOTHING + 内存去重),且通过 tokio::spawn 异步落库,不阻塞响应。
  • 无内容 session 用请求体 SHA-256 派生稳定 key,重复重试在内存与库层都会被去重。

存储

  • 新增 AdminModerationStore trait + empty.rs stub + Postgres 实现。
  • 迁移 0036_keyword_moderation.sqlllm_moderation_keywordsllm_moderation_banned_sessions(JSONB body/headers,带状态与审核索引)。

Admin API

/admin/llm-gateway/moderation/*:关键词列表/批量导入(txt/json)/删除;封禁 session 分页列表、详情(含完整请求体/头)、解封或维持封禁。复用现有 ensure_admin_access 鉴权。

测试与门禁

  • llm-access-core 匹配引擎单测 11 项(归一化、短语容错、ASCII 词边界、CJK、txt/json 解析、正文抽取)全过。
  • llm-access 门禁模块单测 7 项(session key、脱敏头、body JSON 包装、kiro/json 正文抽取、disabled gate)全过。
  • cargo clippyllm-access 全栈及 static-flow-frontend(wasm32)均零警告。
  • rustfmt 仅格式化改动文件。

注:static-flow-backend 因本机缺少 protoclance 子模块构建依赖)无法本地整编,但本 PR 对 backend 的改动仅为 2 行 SSR 路由注册;usage_worker 的 5 个失败为 Windows/DuckDB 文件锁的既有环境问题(已在基线分支复现),与本改动无关。

部署面

改动集中在 llm-access*,按仓库约定生产发布目标为 AWS 云上 llm-access 服务。

🤖 Generated with Claude Code

Add a pre-upstream keyword moderation module for the Kiro and Codex
gateways. When request content matches a configured keyword the session
is banned in memory and this plus all subsequent requests are blocked;
the full request body and (redacted) headers are captured once for
admin review, and a reviewer can unban a session.

Design highlights:
- Phrase matching via Aho-Corasick over normalized text (lowercased,
  whitespace-collapsed); ASCII keywords require word boundaries while
  CJK phrases match freely. Only user-visible message content is scanned
  (system + messages[].content / instructions), not JSON structure noise.
- Keywords import from plain-text (one phrase per line) or JSON.
- Hot path never reads Postgres: the compiled automaton plus banned /
  allowlisted session-key sets live in process memory, refreshed on
  startup and a periodic interval. Already-banned sessions are rejected
  without a scan or a write; a new ban persists exactly once (JSONB body
  + headers) via a spawned task.
- Admin API + Yew review console: manage keywords and review captured
  bans (inspect payload, keep or lift the ban).

Storage: new AdminModerationStore trait, empty stub, Postgres impl, and
migration 0036 (llm_moderation_keywords, llm_moderation_banned_sessions).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a keyword moderation gate for the LLM gateway, allowing administrators to block requests containing banned keywords and review flagged sessions. It adds an admin moderation page in the frontend, backend API endpoints, database tables for keywords and banned sessions, and an in-memory ModerationGate that filters requests on the hot path. Feedback on the implementation suggests several optimizations and safety improvements: using safe string slicing to prevent panics on non-UTF-8 boundaries, removing redundant lowercase conversions on header names, optimizing digest formatting and key allocations to reduce string allocations, and adding a composite index on (banned_at_ms DESC, id DESC) to improve pagination query performance.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +127 to +157
fn match_context_snippet(text: &str, start: usize, end: usize) -> String {
let snippet_start = {
let mut cursor = start;
for _ in 0..MATCH_CONTEXT_RADIUS_CHARS {
match text[..cursor].char_indices().next_back() {
Some((index, _)) => cursor = index,
None => break,
}
}
cursor
};
let snippet_end = {
let mut cursor = end;
for _ in 0..MATCH_CONTEXT_RADIUS_CHARS {
match text[cursor..].chars().next() {
Some(ch) => cursor += ch.len_utf8(),
None => break,
}
}
cursor
};
let mut snippet = String::new();
if snippet_start > 0 {
snippet.push('…');
}
snippet.push_str(&text[snippet_start..snippet_end]);
if snippet_end < text.len() {
snippet.push('…');
}
snippet
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using direct string slicing (e.g., text[..cursor]) can cause panics if the indices are not valid UTF-8 character boundaries. To safely slice a string slice in Rust without panicking, use .get(..index) instead of direct slicing.

fn match_context_snippet(text: &str, start: usize, end: usize) -> String {
    let snippet_start = {
        let mut cursor = start;
        for _ in 0..MATCH_CONTEXT_RADIUS_CHARS {
            match text.get(..cursor).and_then(|s| s.char_indices().next_back()) {
                Some((index, _)) => cursor = index,
                None => break,
            }
        }
        cursor
    };
    let snippet_end = {
        let mut cursor = end;
        for _ in 0..MATCH_CONTEXT_RADIUS_CHARS {
            match text.get(cursor..).and_then(|s| s.chars().next()) {
                Some(ch) => cursor += ch.len_utf8(),
                None => break,
            }
        }
        cursor
    };
    let mut snippet = String::new();
    if snippet_start > 0 {
        snippet.push('…');
    }
    if let Some(segment) = text.get(snippet_start..snippet_end) {
        snippet.push_str(segment);
    }
    if snippet_end < text.len() {
        snippet.push('…');
    }
    snippet
}
References
  1. To safely slice a string slice in Rust without panicking on non-UTF-8 character boundaries while preserving byte semantics, use .get(..index) instead of direct slicing [..index] or converting to character iterators (e.g., chars().take()).

Comment on lines +274 to +275
for name in headers.keys() {
let key = name.as_str().to_ascii_lowercase();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the http crate, HeaderName::as_str() is already guaranteed to be lowercase. Calling .to_ascii_lowercase() on it is redundant and causes unnecessary allocations.

Suggested change
for name in headers.keys() {
let key = name.as_str().to_ascii_lowercase();
for name in headers.keys() {
let key = name.as_str().to_string();
References
  1. Avoid calling .to_lowercase() on strings or constants that are already known to be lowercase, as it causes redundant allocations. Query sets or maps directly using the borrowed lowercase string.

Comment on lines +260 to +269
pub(crate) fn derived_moderation_session_key(provider: &str, key_id: &str, body: &[u8]) -> String {
let mut hasher = Sha256::new();
hasher.update(body);
let digest = hasher.finalize();
let mut preview = String::with_capacity(16);
for byte in digest.iter().take(8) {
preview.push_str(&format!("{byte:02x}"));
}
format!("{provider}:{key_id}:content:{preview}")
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Formatting each byte of the digest in a loop using format! performs 8 separate string allocations. We can optimize this by converting the first 8 bytes of the digest to a u64 and formatting it once, which reduces allocations to exactly one.

Suggested change
pub(crate) fn derived_moderation_session_key(provider: &str, key_id: &str, body: &[u8]) -> String {
let mut hasher = Sha256::new();
hasher.update(body);
let digest = hasher.finalize();
let mut preview = String::with_capacity(16);
for byte in digest.iter().take(8) {
preview.push_str(&format!("{byte:02x}"));
}
format!("{provider}:{key_id}:content:{preview}")
}
pub(crate) fn derived_moderation_session_key(provider: &str, key_id: &str, body: &[u8]) -> String {
let mut hasher = Sha256::new();
hasher.update(body);
let digest = hasher.finalize();
let mut bytes = [0u8; 8];
bytes.copy_from_slice(&digest[..8]);
let val = u64::from_be_bytes(bytes);
let preview = format!("{val:016x}");
format!("{provider}:{key_id}:content:{preview}")
}

Comment on lines +247 to +251
fn state_ban(&mut self, session_key: &str) -> bool {
self.allowed.remove(session_key);
self.banned.insert(session_key.to_string())
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling session_key.to_string() on every call to state_ban causes an unnecessary allocation if the session is already present in self.banned. Checking self.banned.contains first avoids this allocation.

impl ModerationGateState {
    fn state_ban(&mut self, session_key: &str) -> bool {
        if self.banned.contains(session_key) {
            return false;
        }
        self.allowed.remove(session_key);
        self.banned.insert(session_key.to_string())
    }
}
References
  1. Avoid allocating keys (e.g., calling .to_string()) on every iteration of a loop when querying a map, especially on performance-critical hot paths or while holding a lock. Instead, query the map using a borrowed key (e.g., get_mut(key.as_ref())) and only allocate a new key when inserting a new entry for the first time. This reduces allocations from O(N) to O(distinct keys) and minimizes lock contention.

Comment on lines +30 to +34
CREATE INDEX IF NOT EXISTS idx_llm_moderation_banned_sessions_status_banned_at
ON llm_moderation_banned_sessions(status, banned_at_ms DESC);

CREATE INDEX IF NOT EXISTS idx_llm_moderation_banned_sessions_key_id
ON llm_moderation_banned_sessions(key_id, banned_at_ms DESC);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The query in list_moderation_banned_sessions without a status filter orders by banned_at_ms DESC, id DESC. Without an index on (banned_at_ms DESC, id DESC), this query will require a full table scan and filesort as the table grows. Adding a composite index on these fields will significantly improve pagination performance.

CREATE INDEX IF NOT EXISTS idx_llm_moderation_banned_sessions_status_banned_at
    ON llm_moderation_banned_sessions(status, banned_at_ms DESC);

CREATE INDEX IF NOT EXISTS idx_llm_moderation_banned_sessions_banned_at
    ON llm_moderation_banned_sessions(banned_at_ms DESC, id DESC);

CREATE INDEX IF NOT EXISTS idx_llm_moderation_banned_sessions_key_id
    ON llm_moderation_banned_sessions(key_id, banned_at_ms DESC);

acking-you and others added 10 commits July 4, 2026 02:52
Exercise the moderation store against a real Postgres (Neon CI branch):
keyword bulk import with within-batch and cross-call ON CONFLICT dedup,
NULLIF note coercion, delete/RETURNING, banned-session capture with JSONB
body+headers, session_key conflict dedup, status-filtered pagination,
review/unban, and the runtime snapshot contract. Gated on TEST_POSTGRES_URL
and skipped when unset, matching the existing integration tests. Adds the
two moderation tables to the reset_test_db TRUNCATE list for isolation.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Keywords and request text may contain punctuation and arbitrary spacing.
Normalize both sides through a shared tokenizer before the Aho-Corasick
phrase match: lowercase, split into terms (alphanumeric runs for
space-delimited scripts, one term per ideographic character), drop
punctuation/whitespace as separators, and rejoin with a single space.

This makes matching insensitive to punctuation/spacing (`build a bomb`
matches `Build, a  bomb!`) and, because ideographs tokenize per character,
defeats separator-injection evasion (`习.近.平` still matches `习近平`).
Term-boundary alignment on the canonical form keeps `bomb` from firing
inside `bomber`. The Halfwidth & Fullwidth Forms block is excluded from
the ideographic set so fullwidth punctuation stays a separator.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Apply verified findings from a full-feature review pass:

Hot path
- Add ModerationGate::is_active() and a shared enforce_moderation() helper
  so a dormant gate (no keywords, no bans) does zero work — no SHA-256
  session-key derivation, no text extraction, no scan.
- Collapse the triplicated ban-record + precheck logic across the kiro,
  codex, and direct-anthropic hooks into enforce_moderation(), removing
  the duplicated key derivation and MessagesRequest extraction.

Capture fidelity
- Store request_body_json/request_headers_json as TEXT, not JSONB, so the
  captured wire bytes are preserved verbatim for review instead of being
  reparsed/reordered; moderation_body_text() drops the extra JSON parse.
- Add the missing reviewed_at_ms >= 0 CHECK to migration 0036.

Matching
- Fold fullwidth ASCII (B→b, fullwidth punctuation→separator) in the
  tokenizer so fullwidth-form evasion is caught.
- Drop the unused ModerationMatch.pattern_index field.

Admin & UI
- Cap keyword imports (count + per-keyword length).
- Banned-sessions review console: pagination (prev/next), a close control
  on the capture panel, clear the stale panel after a review, clear the
  error banner on a successful load, and add table header scope semantics.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add diagram-driven English documentation:
- llm-access-core moderation: the tokenize → canonical-form → Aho-Corasick
  → term-boundary pipeline, worked English and CJK examples, and the
  Aho-Corasick complexity rationale (single O(n) scan over all keywords).
- llm-access moderation gate: the memory-vs-Postgres caching contract and
  the per-request enforce_moderation() decision flow (dormant → session
  key → precheck → scan → ban), as ASCII diagrams.
- Pointer comments at the three dispatch hook sites and the migration.

Comments only; no behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…name

Replace the CJK moderation example `习近平` with the neutral, on-theme
`违禁词` ("banned word") in the module docs, the is_ideographic_char note,
and the tokenizer test. Same 3-ideograph shape, so the diagrams and the
separator-evasion demonstration are unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Classify moderation keywords under an 11-category risk taxonomy aligned to
the OpenAI usage policy (csam, sexual, weapons, extremism, drugs, criminal,
fraud, cyber, piracy, self_harm, jailbreak). A keyword may carry several
categories; a ban record captures the categories of the keyword that fired.

- Data model: new llm_moderation_categories table; category_codes on
  keywords and matched_categories on banned sessions (JSONB arrays, GIN
  indexed). Migration 0037 seeds the 11 categories and the 642 classified
  blocklist keywords (canonicalized through the real tokenizer). The
  classification was generated by range+override mapping and adversarially
  audited (only 1/642 corrected).
- Matcher: ModerationMatcher carries per-keyword categories; a hit returns
  them, and the gate records them on the ban.
- Store trait + Postgres: category list/add/delete (delete refuses while a
  keyword still references the code), keyword import with categories.
- Admin API: /moderation/categories list/add/delete; keyword import accepts
  a validated batch-level category set.
- Frontend: a Categories tab (manage the taxonomy), category multi-select on
  import, and severity-colored category badges on the keyword list, the
  banned-session list, and the capture detail.

The client-facing rejection stays generic (no keyword/category leak); the
admin console shows the full keyword + categories.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Review of the hit-scoped unban surfaced correctness bugs; fixes:

1. [blocker] Suppression bypass: the resume loop advanced by match_start
   (find_after, exclusive on start), which discards EVERY hit at that
   offset — including a DISTINCT unsuppressed keyword sharing the start
   (e.g. `bomb` and `bomb making`, or `习`/`习近`). Unbanning the shorter
   one silently masked the longer one. Replace the position-based resume
   scan with ModerationMatcher::find_accepted: one overlapping scan from 0
   that skips only suppressed hit_keys, so co-located keywords are each
   evaluated. Drops the unsafe resume-cursor scan-skip (it could also miss
   a longer keyword starting before a cursor); per-hit content-scoping is
   preserved via the prefix hash folded into hit_key. Regression test added.

2. [major] Drop the partial UNIQUE index on (session_key) WHERE
   status='banned' — it enforced one active ban per session, contradicting
   the multi-hit model and 500ing when re-banning a reviewed hit. Per-hit
   uniqueness is already covered by hit_key UNIQUE.

3. [minor] record_moderation_banned_session now uses ON CONFLICT (hit_key)
   DO NOTHING so a distinct new hit in an already-banned session is captured
   rather than silently swallowed by the dropped index's conflict.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Document, in code, why moderation suppression must skip by hit identity
rather than by scan position:
- A new module-doc section "Hit-scoped unban" spells out what hit_key is
  (session + keyword + offsets + preceding-content hash), why the content
  prefix makes suppression fail-closed on any content change, and a
  WRONG-vs-RIGHT diagram showing how position-based skipping lets a distinct
  keyword sharing a suppressed hit's start offset (bomb / bomb making) slip
  through — the bypass fixed by find_accepted.
- find_accepted's rustdoc explains it exists precisely to avoid that bypass.

Comments only; no behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@acking-you acking-you merged commit 82b7075 into master Jul 4, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant